GVPT Maths Boot Camp

Data Visualisation

Learning objectives for today

  1. Create your first plot in R

  2. Test your hypotheses using informative data visualisations

The Research Process

Source: R4DS

R basics

R code:

1 + 2
[1] 3

Functions:

sum(1, 2)
[1] 3

R packages

Packages are collections of R functions, data, and compiled code in a well-defined format

# Install the relevant package(s)
install.packages("tidyverse")

# Load the package in current session
library(tidyverse)

Data visualisation

From R4DS - Data Visualization:

Do cars with big engines use more fuel than cars with small engines?

Skipping to the end

Load relevant packages and data

# Load the relevant functions
library(tidyverse)

# Load the data
mpg
manufacturer model displ year cyl
audi a4 1.8 1999 4
audi a4 1.8 1999 4
audi a4 2.0 2008 4
audi a4 2.0 2008 4
audi a4 2.8 1999 6
audi a4 2.8 1999 6

Learning more about the data

To get help on any function or dataset use: ?function


For example, to learn more about this package type ?mpg into your console.

Plot your data

library(ggplot2)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

EXERCISES

What do you see when you run the following?

ggplot(data = mpg)

How many rows are in mpg? How many columns?

nrow(mpg)
ncol(mpg)

What does the drv variable describe?

?mpg

Make a scatterplot of hwy vs cyl.

What happens if you make a scatterplot of class vs drv? Why is the plot not useful?

Let’s look at groups in the data

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

Let’s look at groups in the data

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, size = class))

Let’s look at groups in the data

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, shape = class))

Let’s make this prettier

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

Let’s add useful headings

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue") + 
  labs(
    title = "Relationship between engine displacement and highway miles per gallon",
    x = "Engine displacement (L)",
    y = "Highway miles per gallon"
  )

Let’s add useful headings

Let’s clean this up

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue") + 
  theme_minimal() + 
  labs(
    title = "Relationship between engine displacement and highway miles per gallon",
    x = "Engine displacement (L)",
    y = "Highway miles per gallon"
  )

Let’s clean this up

Creating your own theme

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class)) + 
  theme(
    legend.position = "bottom",
    panel.grid = element_blank(),
    panel.background = element_blank(),
    plot.title.position = "plot",
    plot.title = element_text(face = "bold")
  ) + 
  labs(
    title = "Relationship between engine displacement and highway miles per gallon by class",
    x = "Engine displacement (L)",
    y = "Highway miles per gallon",
    color = "Class"
  )

Creating your own theme

EXERCISES

What’s gone wrong with this code? Why are the points not blue?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))

EXERCISES

Which variables in mpg are categorical? Which variables are continuous?


Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?


What happens if you map the same variable to multiple aesthetics?

EXERCISES

What does the stroke aesthetic do? What shapes does it work with?


What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)? Note, you’ll also need to specify x and y.

Summarising relationships in the data

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = F)

Group-specific relationships

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = F) + 
  theme_minimal()

Summary

Today you:

  1. Set up your data science tools

  2. Plotted complex data in an engaging way

  3. Discovered interesting relationships in the data

  4. Connected these relationships or trends to your expectations (or hypotheses about the data)